Designing optimal behavioral experiments using machine learning

Computational models are powerful tools for understanding human cognition and behavior. They let us express our theories clearly and precisely and offer predictions that can be subtle and often counter-intuitive. However, this same richness and ability to surprise means our scientific intuitions and traditional tools are ill-suited to designing experiments to test and compare these models. To avoid these pitfalls and realize the full potential of computational modeling, we require tools to design experiments that provide clear answers about what models explain human behavior and the auxiliary assumptions those models must make. Bayesian optimal experimental design (BOED) formalizes the search for optimal experimental designs by identifying experiments that are expected to yield informative data. In this work, we provide a tutorial on leveraging recent advances in BOED and machine learning to find optimal experiments for any kind of model that we can simulate data from, and show how by-products of this procedure allow for quick and straightforward evaluation of models and their parameters against real experimental data. As a case study, we consider theories of how people balance exploration and exploitation in multi-armed bandit decision-making tasks. We validate the presented approach using simulations and a real-world experiment. As compared to experimental designs commonly used in the literature, we show that our optimal designs more efficiently determine which of a set of models best account for individual human behavior, and more efficiently characterize behavior given a preferred model. At the same time, formalizing a scientific question such that it can be adequately addressed with BOED can be challenging and we discuss several potential caveats and pitfalls that practitioners should be aware of. We provide code to replicate all analyses as well as tutorial notebooks and pointers to adapt the methodology to different experimental settings.


Introduction
Computational modeling of behavioral phenomena is currently experiencing rapid growth, in particular with respect to methodological improvements.For instance, seminal work by Wilson and Collins (2019) has been crucial in raising methodological standards and bringing attention to how computational analyses can add value to the study of human (and animal) behavior.Meanwhile, in most instances, computational analyses are applied to data collected from experiments that were designed based on intuition and convention.That is, experimental designs are often chosen without explicitly considering how informative these data might be for the computational analyses and, finally, the scientific question being studied.This can, in the worst case, completely undermine the research effort, especially as particularly valuable experimental designs can be counter-intuitive.Today, advancements in machine learning open up the possibility of applying computational methods when deciding how experiments should be designed in the first place, to yield data that are maximally informative with respect to the scientific question at hand.In this work, we provide a step-by-step tutorial on how these advancements can be combined and leveraged to find optimal experimental designs for any computational model that we can simulate data from, and show how by-products of this procedure allow for quick and straightforward evaluation of models and their parameters against real experimental data.

Why optimize experimental designs?
Experiments are the bedrock of scientific data collection, making it possible to discriminate between different theories or models, and discovering what specific commitments a model must make to align with reality by means of parameter estimation.The process of designing experiments involves making a set of complex decisions, where the space of possible experimental designs is typically navigated based on a combination of researchers' prior experience, intuitions about what designs may be informative, and convention.This approach has been successful for centuries across various scientific fields, but it has important limitations that are amplified as theories become richer, more nuanced and more complex.Often, a theory does not make a concrete prediction, but rather has free parameters that license a variety of possible predictions of varying plausibility.For example, if we want to model human behavior, it is important to recognize that different participants in a task might have different strategies or priorities, which can be captured by parameters in a model.As we expand models to accommodate more complex effects, it becomes ever more time-consuming and difficult to design experiments that distinguish between models, or allow for effective parameter estimation.As an example from cognitive science, there are instances where models turned out to be empirically indistinguishable under certain conditions, a fact that was only recognized after a large number of experiments had already been done (Jones and Dzhafarov, 2014).Such problems also play an important role in the larger context of the replication crisis (e.g., Pashler and Wagenmakers, 2012).Collecting experimental data is typically a costly and time consuming process, especially in the case of studies involving resourceintensive recording techniques, such as neuroimaging or eye-tracking (e.g., Dale, 1999;Lorenz et al., 2016).There is a strong economic as well as ethical case for conducting experiments that are expected to lead to maximally informative data with respect to the scientific question at hand.For instance, in computational psychiatry, researchers are often interested in estimating parameters describing people's traits and how they relate, often doing so under constrained resources in terms of the number of participants and their time.Additionally, researchers and practitioners alike require that experiments are sufficiently informative to provide confidence that they would reveal effects if they are there and also that negative results are true negatives (e.g., Huys et al., 2013;Karvelis et al., 2018).Consequently, as theories of natural phenomena become more realistic and complex, researchers must expand their methodological toolkit for designing scientific experiments.

Bayesian optimal experimental design
As a potential solution, Bayesian optimal experimental design (BOED; see Ryan et al., 2016, for a review) provides a principled framework for optimizing the design of experiments.Broadly speaking, BOED rephrases the task of finding optimal experimental designs to that of solving an optimization problem.That is, the researcher specifies all controllable parameters of an experiment, also known as experimental designs, for which they wish to find optimal settings, which are then determined by maximizing a utility function.Exactly what we might want to optimize depends on our experiment, but can include any aspect of the design that we can specify, such as which stimuli to present, when or where measurements should be taken in a quasi-experiment, or, as an example, how to reward risky choices in a behavioral experiment.The utility function is selected to measure the quality of a given experimental design with respect to the scientific goal at hand, such as model discrimination, parameter estimation or prediction of future observations, with typical choices including expected information gain, uncertainty reduction, and many others (Ryan et al., 2016), as explained below.Importantly, BOED approaches require that theories are formalized via computational models of the underlying natural phenomena.

Computational models
Scientific theories are increasingly formalized via computational models (e.g., Guest and Martin, 2021).These models take many forms, but one desideratum for a model is that, given a set of experimental data, we can compare it to other models based on how well it predicts that data.To that end, many computational models are constructed in a way that allows for the analytic evaluation of the likelihood of the model given data, by means of a likelihood function.How-ever, this imposes strong constraints on the space of models we can consider.Models that are rich enough to accurately describe complex phenomena-like human behavior-often have intractable likelihood functions (Cranmer et al., 2020).In other words, we cannot compute likelihoods due to the prohibitive computational cost of doing so, or the inability to express the likelihood function mathematically.This has at least two implications (1) When we develop rich models, we often lack the tools to evaluate and compare them directly, and are left with coarse, qualitative proxy measures-like the "the model curves look like the human curves for some parameter settings!";(2) when we are intent on conducting a systematic comparison of models, we are often forced to use simplified models that have tractable likelihoods, even when the simplifications lead to consequential departures from the theories we might want our models to capture.
Simulator models: What if the likelihood cannot be computed?
Many realistically complex models have the feature that we can simulate data from them, which allows for using simulation-based inference methods, such as approximate Bayesian computation (ABC) and many others (Cranmer et al., 2020).There has been a growing interest in this class of models, often referred to as generative, implicit or simulator models (e.g., Cranmer et al., 2020;Brehmer et al., 2020;Sisson et al., 2007;Marjoram et al., 2003).We will use the term simulator model throughout this paper to refer to any model from which we can simulate data.We note that some generative models have tractable likelihood functions, but this class of models is subsumed by simulator models, which do not make any structural assumptions on the likelihood and include important cases where the likelihood is too expensive to be computed, or cannot be computed at all.1 Simulator models have become ubiquitous in the sciences, including physics (Schafer and Freeman, 2012), biology (Ross et al., 2017), economics (Gourieroux et al., 2010), epidemiology (Corander et al., 2017) and cognitive science (Palestro et al., 2018), among others.The study of human, and animal, perception and cognition is experiencing a surge in the use of computational modeling (Griffiths, 2015).Here, simulator models can formalize complex theories about latent psychological processes to arrive at quantitative predictions about observable behavior.As such, simulator models appear across cognitive science, including e.g., many Bayesian and connectionist models, a wide range of process models, cognitive architectures, or many reinforcement learning models and physics engines (see e.g., Palestro et al., 2018;Gebhardt et al., 2020;Turner and Van Zandt, 2012;Turner and Sederberg, 2014;Bitzer et al., 2014;Turner et al., 2018;Bramley et al., 2018;Ullman et al., 2018;Kangasrääsiö et al., 2019).The applicability and usefulness of experimental design optimization for parameter estimation and model comparison has been Figure 1: A high level overview of our approach that uses machine learning (ML) to optimize the design of experiments.The inputs to our method are a model from which we can simulate data and a prior distribution over the model parameters.Our method starts by drawing a set of samples from the prior and initializing an experimental design.These are used to simulate artificial data using the simulator model.We then use ML to estimate the expected information gain of those data, which is used as a metric to search over the design space.Finally, this is repeated until convergence of the expected information gain.The outputs of our method are optimal experimental designs, an estimate of the expected information gain when performing the experiment and an amortized posterior distribution, which can be used to cheaply compute approximate posterior distributions once real-world data are observed.Other useful by-products of our method include a set of approximate sufficient summary statistics and a fitted utility surface of the expected information gain.demonstrated in many scientific disciplines, but this has often been restricted to settings with simple and tractable classes of models, for instance in cognitive science (e.g., Myung and Pitt, 2009;Zhang and Lee, 2010;Ouyang et al., 2018).

An overview of this tutorial
In this tutorial, we present a flexible workflow that combines recent advances in machine learning (ML) and Bayesian optimal experimental design (BOED) in order to optimize experimental designs for any computational model from which we can simulate data.A high-level overview of the underlying ML-based BOED method is shown in Figure 1, including researchers' inputs and resulting outputs.The presented approach is applicable to any model from which we can simulate data and scales well to realistic numbers of design variables, experimental trials and blocks.Furthermore, as well as optimized designs, the presented approach also results in tractable amortized posterior inference-that is, it allows researchers to use their actual experimental data to easily compute posterior distributions, which may otherwise be computationally expensive or intractable to compute.
Moreover, we note that, to our knowledge, there has been no study that performs BOED for simulator models without likelihoods and then uses the optimal designs to run a real experiment.Instead, most practical applications of BOED have been limited to simple scientific theories of nature (e.g.Liepe et al., 2013).This paucity of real-world applications is partly because the situations where BOED would be most useful are those where it has previously been computationally intractable to use.The ML-based approach presented in this work aims to overcome these limitations.
As with many new computational tools, however, the techniques for using BOED are not trivial to implement from scratch.In an effort to help other scientists understand these methods and use them to unlock the full potential of BOED in their own research, we next present a case study in using BOED to better understand human decision-making.Specifically, we (1) describe new simulator models that generalize earlier decision-making models that were constrained by the need to have tractable likelihoods; (2) walk through the steps to use BOED to design experiments comparing these simulator models and estimating the psychologically interpretable parameters they contain; and (3) contrast our new, optimized experiments to previous designs and illustrate their advantages and the new lessons they offer.All of our steps and analyses are accompanied by code that we have commented and structured to be easily adapted by other researchers.

Case study: Multi-armed bandit tasks
As a case study application, we consider the question of how people balance their pursuit of short-term reward ("exploitation") against learning how to maximize reward in the longer term ("exploration").This question has been studied at length in psychology, neuroscience, and computer science, and one of the key frameworks for investigating it is the multi-armed bandit decision-making task (e.g., Steyvers et al., 2009;Dezfouli et al., 2019;Schulz et al., 2020;Gershman, 2018).Multi-armed bandits have a long history in statistics and machine learning (e.g., Robbins, 1952;Sutton and Barto, 2018) and formalize a general class of sequential decision problems-repeatedly choosing between a set of options under uncertainty with the goal of maximizing cumulative reward (for more background, see Box 1).
We demonstrate the applicability of our method by optimizing the design of a multi-armed bandit decision-making task, considering three flexible extensions of previously proposed models of human behavior.In our experiments, we consider two general scientific tasks: model discrimination and parameter estimation.We then evaluate the optimal designs found using our method em-pirically with simulations and real human behavioral data collected from online experiments, and compare them to designs commonly used in the literature.We find that, as compared to commonly-used designs, our optimal experimental designs yield significantly better model recovery, more informative posterior distributions and improved model parameter disentanglement.

Box 1. Bandit tasks
Bandit tasks constitute a minimal reinforcement learning task, in which the agent is faced with the problem of balancing exploration and exploitation.Various algorithms have been proposed for maximizing the reward in bandit tasks, and some of these algorithms have been used directly or in modified form as models of human behavior.A selection of such algorithms indeed captures important psychological mechanisms in solving bandit tasks (Steyvers et al., 2009).However, a richer account of human behavior in bandit tasks would seem to be required to accommodate flexibility and nuance of human behavioral (e.g., see Dezfouli et al., 2019;Lee et al., 2011), leading to simulator models that do not necessarily permit familiar likelihood-based inference, due to, e.g., unobserved latent states, and which complicate the task of intuiting informative experiments.See

Workflow Summary
In the following sections, we describe and walk through our workflow step-bystep, using the aforementioned multi-armed bandit task setting as a concrete case study to further illustrate ML-based BOED.At a high level, our workflow comprises the following steps: 1. Defining a scientific goal, e.g. which of a set of models best describes a natural phenomenon.
2. Formalizing a theory, or theories, as a computational model(s) that can be sampled from.
3. Setting up the design optimization problem and deciding which aspects of the experimental design need to be optimized.
4. Constructing the required machine learning models and training them with simulated data.
5. Validating the obtained optimal designs in silico and, potentially, reevaluating design choices.
6.After confirming and validating the optimal design, the real experiment can be performed.The trained machine learning models used for BOED can be used afterwards to easily compute posterior distributions.
We provide a more detailed description of the above workflow in the following sections.The outputs of the ML-based BOED method are optimal experimental designs, an estimate of the (maximum) expected information gain and an amortized posterior distribution (see Materials and Methods), which allows for straightforward computation of posterior distributions from collections of realworld data and saves us a potentially computationally expensive (likelihoodfree) inference step.Readers may find that many of these steps should be part of the design of experiments no matter whether using BOED or not.However, BOED here provides the additional advantage of making all of these steps, which are sometimes carried out heuristically, explicit and formal.
Step 1: Formulate your scientific goal The first step in setting up the experimental design optimization is to define our scientific goal.While there may be many different goals for an experiment, they often fall into the broad categories of model discrimination (MD) or parameter estimation (PE).

Model discrimination
Computational models provide formal and testable implementations of theories about nature.Model discrimination (also often referred to as model compari-son) lies at the core of many scientific questions and amounts to the problem of deciding which of a set of different models best explains some observed data.A rigorous formulation of this problem is given by the Bayesian approach: Starting from a prior belief about the plausibility of different models and their parameters, we consider how well different models can explain the observed data and use this information to update our prior beliefs.Crucially, more flexible theories are penalized automatically via the notion of the Bayesian Ockham's razor.This ensures that in cases where two theories can explain the data equally well, we should favor the simpler one.Formally, this requires computing the marginal likelihood of our different models, which may mean integrating (or for discrete parameters, summing) over a large number of model parameters, which is often intractable, and thus often tackled via information criteria, which may, however, suffer from various problems (e.g., Pitt and Myung, 2002;Dziak et al., 2020;Schad et al., 2021).This computational complexity of computing marginal model likelihoods is exacerbated when model parameter likelihood is intractable, as is the case for many simulator models.
With respect to our case study, we may, for instance, be interested in comparing which of the three simulator models presented in Box 2 best explains human behavior in multi-armed bandit tasks.

Parameter estimation
Researchers are often interested in learning about model parameters of a particular, single model.Similar to the model discrimination approach, a Bayesian approach of learning about these parameters would involve starting with a prior distribution and then updating our belief via Bayes' rule using observed data.For instance, we may be interested in estimating an individual's (or population's) working memory capacity given a working memory model, response time distribution in a model for selective attention, choice probability of choosing between different options in a risky choice model, inclination to explore versus exploit in a reinforcement task, or associations between biases in probabilistic visual integration tasks and autistic/schizotypal traits (Karvelis et al., 2018) to name a few examples.Model parameters may here be continuous or discrete, which matters for how we setup up the optimization procedure in Step 4.
As an example, for the AEG simulator model in our case study, we are, among many other things, interested in estimating peoples' inclination to reselect previously chosen options as opposed to avoiding repeating their own past behavior, which is characterized by a specific model parameter.

Other goals
While we here focus on the problems of model discrimination and parameter estimation, we point out that the overall BOED approach via mutual information estimation is more general, and can easily be adapted to other tasks, such as improving future predictions (e.g.Kleinegesse and Gutmann, 2021) and many others (see Ryan et al., 2016).

How do we measure the value of an experimental design?
We formalize the quality of our experimental design via a utility function, which thereby serves as a quantitative way of measuring the value of one experimental design over any another and provides the objective function for our optimization problem.Following recent work in BOED (e.g., Kleinegesse andGutmann, 2019, 2020;Kleinegesse et al., 2020;Kleinegesse and Gutmann, 2021;Foster et al., 2019), we use mutual information (MI; Lindley, 1956) (also known as the expected information gain) as our utility function U (d) for measuring the value of an experimental design d, i.e.
where v is a variable we wish to estimate.
For the MD task, the variable of interest v in our utility function corresponds to a scalar model indicator variable m (where m = 1 corresponds to a model 1, m = 2 to some model 2 and so on for as many models as we are comparing).For the PE task, the variable of interest v in our utility function is the vector of model parameters θ m for a given model m.
Mutual information has several appealing properties (Paninski, 2005) that make it a popular utility function in BOED (Ryan et al., 2016).Intuitively, MI quantifies the amount of information our experiment is expected to provide about the variable of interest.Furthermore, the MI utility function can easily be adapted to various scientific goals by changing the variable of interest (Kleinegesse and Gutmann, 2021).Optimal designs are then found by maximizing this utility function, i.e. d * = arg max U (d).Although a useful quantity, estimating U (d) is typically extremely difficult, especially for simulator models that have intractable likelihoods p(y|v, d) (e.g., Ryan et al., 2016).Nonetheless, in the following steps we shall explain how we can approximate this utility function using ML-based approaches.
Step 2: Cast your theory as a computational model After defining our scientific goal, we need to ensure that our scientific theories are, or can be, formalized as computational models.This either requires us to formalize them from the ground up, or leverage existing computational models from literature that support our theories.In the former case, this entails making all aspects of the theory explicit, which is therefore a highly useful and established practice; we refer the reader to Wilson and Collins (2019) for a more general guidance on building computational models.For the presented ML-based BOED approach, we require that our theories are cast as computational models that allow us to simulate data, i.e. simulator models.Further we need to decide on a general experimental paradigm that needs to be able to interact with our computational models.

Building a computational model
Formally, a computational model defines a generative model y ∼ p(y|θ, d) that allows sampling of synthetic data y, given values of its model parameters θ and the experimental design d.We assume that sampling from this model is feasible, but make no assumptions about whether evaluating the likelihood function p(y|θ, d) is tractable or intractable.Even if the likelihood is tractable and simple, important quantities such as the marginal likelihood may already be intractable, since evaluating the marginal likelihood requires integrating over all free parameters.This problem is naturally exacerbated for more complex models that may have intractable likelihood functions.
See Box 2 for a set of simulator models from our case study.These models describe human behavior in a multi-armed bandit setting, where the observed data y corresponds to the sequences of chosen bandit arms along with their observed reward.

What are our prior beliefs?
In addition to formalizing our theories as simulator models, we need to complete this step by defining prior distributions over the (free) model parameters, as would be required for any analysis.We point the reader to Wilson and Collins (2019); Schad et al. (2021); Mikkola et al. (2021) for guidance on specifying prior distributions, as this is a broad research topic and a detailed treatment is beyond the scope of this work.Generally, prior distributions should reflect what is known about the model parameters prior to running the experiment, either based on empirical data or general domain knowledge.This is crucial for the design of experiments, as we may otherwise spend experimental resources and effort on learning what is already known about our models.

Box 2. Models of human behavior in bandit tasks
We study models of human choice behavior in bandit tasks to demonstrate and validate our method.We generalize previously proposed models (see the SI for details), and consider three novel computational models for which likelihoods are intractable.
Win-Stay Lose-Thompson-Sample (WSLTS) Posits that people tend to repeat choices that resulted in a reward, and otherwise to explore by selecting options with a probability proportional to their posterior probability of being the best (Thompson sampling).This model generalizes the Win-Stay Lose-Shift model that has been considered in previous work (Steyvers et al., 2009), but accommodates the possibility that people might be more likely to shift to more promising alternatives than uniformly at random.
Auto-regressive ε-Greedy (AEG) Formalizes the idea that people might greedily choose the option with the highest estimated reward in each trial with a certain probability or otherwise explore randomly, but also permits that people may be "sticky", i.e. having some tendency to reselect whatever option was chosen on the previous trial, or "anti-sticky", i.e. preferring to switch to a new option on each trial.This generalizes the ε-greedy model studied previously (Steyvers et al., 2009), further accommodating the possibility that people preferentially stick to or avoid their previous selection.
Generalized Latent State(GLS) Model captures the idea that people might have an internal latent exploration-exploitation state that influences their choices when encountering situations in which they are faced with an explore-exploit dilemma.Transitions between these states may depend on previously encountered rewards and the previous latent state.This generalizes previous latent-state models that assume people shift between states independently of previous states and observed rewards Zhang et al. (2009), or only once per task (Lee et al., 2011).As these models are novel and to illustrate the method, we do specify strong prior beliefs about any model parameters and describe the priors distribution in the SI.
Step 3: Decide which aspect of your experiment to optimise Having formalized our scientific theories as simulator models and decided on our scientific goals, we need to think about which parts of the experiment we want to optimize.In principle, any controllable parameter of the experimental apparatus could be tuned as part of the experimental design optimization.Concretely, in order to solve an experimental design problem by means of BOED, our experimental designs d need to be formalized as part of the simulator model p(y|θ, d).Instead of choosing all controllable parameters as experimental designs, it may then help to leverage domain knowledge and prior work to choose a sufficient experimental design set, in order to reduce the dimensionality of the design problem.
In our case study, for instance, we have chosen the reward probabilities of different bandit arms to be the experimental designs to optimize over.This choice posits that certain reward environments yield more informative data than others, in the context of model discrimination and parameter estimation.Our task is then to find the optimal reward environment that yields maximally informative data.
There may be other experimental choices that we will have to make, in particular to limit variability in the observed data, to control for confounding effects, and to reduce non-stationarity, among many other reasons.In our multiarmed bandit setting, for example, we may choose to limit the number of trials in the experiment to avoid participant fatigue, which is usually not accounted for in computational models of behavioral phenomena.Additionally, we may limit the number of available choices in a discrete choice task and any other aspect where domain knowledge suggests reasonable constraints, or where we want to ensure comparability with prior research.

Static and adaptive designs
We consider the common case of static Bayesian optimal experimental design, with the possibility of having multiple blocks of trials.That is, we find a set of optimal designs prior to actually performing the experiment.In some cases, however, a scientist may wish to optimize the designs sequentially as the participant progresses through the experiment.This is called sequential, or adaptive, BOED (see Ryan et al., 2016) and has been explored previously for simple tractable models of human behavior (Myung et al., 2013), or in the domain of adaptive testing (e.g., Weiss and Kingsbury, 1984).More recently, sequential BOED has also been extended to deal with implicit models (Kleinegesse et al., 2020;Ivanova et al., 2021).At this point, sequential BOED is an active area of research and may be a consideration for reserachers in the future.Generally, sequential designs can be more efficient, as they allow to adapt the experiment "on-the-fly".However, sequential designs also make building the actual experiment more complicated, because the experiment needs to be updated in almost real-time to provide a seamless experience.Further, in some settings participants may notice that the experiment adapt to their actions, which may inadvertently alter their behavior.In the future, which approach to follow is likely going to be a choice that uniquely depends on the experiment and research questions at hand, while the overall workflow outlined in this tutorial will be relevant to both.

Box 3. Experimental settings in the case study
Behavioral experiments are often set up with several experimental blocks.We here make the common assumption of exchangeability with respect to the experimental blocks in our analyses and randomize block order in all experiments involving human participants.In our experiments, the bandit task in each experimental block has three choice options (bandit arms) and 30 trials (sequential choices).For each block of trials, the data y consists of 30 actions and 30 rewards, leading to a dimensionality of 60 per block.For a given block, the experimental design d we wish to optimize are the reward probabilities of the multi-armed bandit, which, in our setting, are three-dimensional (corresponding to independent scalar Bernoulli reward probabilities for each bandit arm).The generative process for models of human behavior in bandit tasks is illustrated in Box 2.
We demonstrate the optimization of reward probabilities for multiarmed bandit tasks, with the scientific goals of model discrimination (MD) and parameter estimation (PE).We consider five blocks of 30 trials per participant, yielding a fairly short but still informative experiment.We divide this into two blocks for MD followed by three blocks for PE, where the PE blocks are conditional on the winning model for the participant in the first two blocks.In other words, for the PE task, a different experimental design (which corresponds to a particular setting of the reward probabilities associated with each bandit arm) is presented to a participant depending on the simulator model that best described them in the first two blocks.Since we consider three-armed bandits, the design space is constituted by the possible combinations of reward probabilities.This is 6-dimensional for MD and 9-dimensional for PE.Similarly, the data for a single block is 60-dimensional, so the dimensionality of the data is 120 for MD and 180 for PE.Our method applies to any prior distribution over model parameters or the model indicator, but for demonstration purposes we assume uninformative prior distributions on all model parameters (see SI for details).As our baseline designs, we sample all reward probabilities from a Beta(2, 2) distribution, following a large and well-known experiment on bandit problems with 451 participants (Steyvers et al., 2009).More information about the experimental settings and detailed descriptions of the algorithms can be found in the SI.
Step 4: Use machine learning to design the experiments As previously explained, mutual information (MI) shown in Equation 1 is an effective means of measuring the value (i.e., the utility) of an experimental design setting d in BOED (Paninski, 2005;Ryan et al., 2016).Unfortunately, it is generally an expensive, or intractable, quantity to compute, especially for simulator models.The reason for this is two-fold: Firstly, the MI is defined via a high-dimensional integral.Secondly, the MI requires density evaluations of either the posterior distribution or the marginal likelihood, both of which are expensive to compute for general statistical models, and intractable for simulator models.Recent advances in BOED for simulator models thus advocate the use of cheaper MI lower bounds instead (e.g.Foster et al., 2019;Kleinegesse and Gutmann, 2020;Kleinegesse and Gutmann, 2021;Ivanova et al., 2021).

Approximating mutual information with machine learning
In this work, we leverage the MINEBED method (Kleinegesse and Gutmann, 2020), which allows us to efficiently estimate the MI using machine learning.Specifically, for a particular design d, we first simulate data from the computational models under consideration, (shown in Box 2).The approach works by constructing a lower bound on the MI that is parameterized via a machine learning model.In our case study we use a neural network with an architecture customized for behavioral experiments (shown in Box 4), but any other differentiable machine learning model would work in place.We then train the machine learning model by using the simulated data as input and the lower bound as an objective function, which gradually tightens the lower bound.More concretely, this method works by training a neural network T ψ (v, y), or any other machine learning model, using stochastic gradient-ascent, where ψ are the neural network parameters and the data y is simulated at design d with samples from the prior p(v).We train this neural network by maximizing the following objective function, which is a lower bound of the mutual information, The above lower bound is also known as the NWJ lower bound and has several appealing bias-variance properties (Poole et al., 2019).The expectations in Equation 2 are usually approximated using (Monte-Carlo) sample-averages.
Once we have trained the neural network T ψ (v, y) using stochastic gradientascent on ψ, we estimate the mutual information at d by computing Equation 2 using a held-out test set of data simulated from the computational model.

Dealing with high-dimensional observations
We can efficiently deal with high-dimensional data and do not have to construct (approximately sufficient) summary statistics based on domain expertise, as is commonly done (Myung and Pitt, 2009;Zhang and Lee, 2010;Ouyang et al., 2018), and which can be prohibitively difficult for complex simulator models (Chen et al., 2021).In fact, the previously described method can allows us to automatically learn summary statistics as a by-product, as explained below.
Critical to the performance of our method in our case study, as with any application of neural networks, is the choice of architecture (see Elsken et al. (2019) and Ren et al. (2021) for recent reviews of the challenges and solutions in neural architecture search).In order to develop an effective neural network architecture, we leverage the method of neural approximate sufficient statistics Chen et al. (2021) and propose them as a novel and practical alternative to hand-crafted summary statistics in the context of BOED.Behavioral data is often collected in several blocks that can have different designs but should be analyzed jointly, which can be challenging due to the high dimensionality of the joint data, which typically comprises numerous choices, judgments or other measurements (including high-resolution and/or high-frequency neuroimaging data).In order to deal with a multi-block data structure effectively, we propose an architecture devised specifically for application to behavioral experiments.Our architecture, visualized in Figure 3, incorporates a sub-network for each block of an experiment.The outputs of these sub-networks are then concatenated and passed as an input to a larger neural network network.In doing so, besides being able to reduce the dimensionality of data, each sub-network conveniently learns to approximate sufficient summary statistics of the data from each block in the experiment (Chen et al., 2021).
In this work we develop a bespoke feed-forward neural network architecture, which allows us to efficiently deal with multiple blocks of data.However, the overall approach is not restricted to this particular machine learning method; in fact, any sufficiently-flexible function approximator may be used.In some settings, researchers may be interested in using machine learning models that only require little tuning, such as tree-based ensemble methods like Random Forests (Breiman, 2001) or gradient boosted trees (Friedman, 2001).On the other hand, when dealing with high-dimensional but structured data, such as in eye-tracking or neuroimaging (or combinations of different types of data, see Turner et al., 2017), one may use other appropriate architectures, such as convolutional (LeCun et al., 2015), recurrent (Rumelhart et al., 1986) or graph neural networks (Zhou et al., 2020).

Search over the space of possible experimental designs
Once we have obtained an estimate of U (d) at a candidate design d, we need to update the design appropriately when searching the design space to solve the experimental design problem d * = arg max d U (d).Unfortunately, in the setting of discrete data (such as choices between a set of options) y, gradients with respect to designs are generally unavailable, requiring the use of gradient-free optimization techniques.We here optimize the design d by means of Bayesian optimization (BO; see Shahriari et al., 2015, for a review), which has been used effectively in the experimental design context before (e.g.Martinez-Cantin, 2014; Kleinegesse and Gutmann, 2020).We specifically use a Gaussian Process (GP) as our probabilistic surrogate model with a Matérn-5/2 kernel and Expected Improvement as the acquisition function (these are standard choices).A formal summary of our BOED approach is shown in the SI.
For higher-dimensional continuous design spaces, this search may become difficult and necessitate the use of more scalable BO variants (e.g.Overstall and Woods, 2017;Oh et al., 2018).For discrete design spaces or combinations of discrete and continuous design spaces, one can straightforwardly parallelize the optimization for each level of the discrete variable.If the observed data are continuous, such as participants' response times, and one is able to compute gradients with respect to the designs (i.e., the simulator is differentiable), the design space may also be explored with gradient-based optimization (Kleinegesse and Gutmann, 2021).

Where does machine learning come in?
Machine learning is currently revolutionizing large parts of the natural sciences, with applications ranging from understanding the spread of viruses (Currie et al., 2020) to discovering new molecules (Jumper et al., 2021), forecasting climate change Runge et al. (2019) and developing new theories of human decisionmaking (Peterson et al., 2021).Due to its versatility, machine learning can be integrated effectively into a wide range of scientific workflows, facilitating theory building, data modeling and analysis.In particular, machine learning allows us to automatically discover patterns in high-dimensional data sets.Many recent state-of-the-art methods in natural science therefore deeply integrate machine learning into their data analysis pipelines (e.g., Blei and Smyth, 2017).However, there has been little focus on applying machine learning to improve the data collection process, which ultimately determines the quality of data, and thus the efficacy of downstream analyses and inferences.In this work, we use machine learning methods as part of our optimization scheme, where they serve as flexible function approximators.

Box 5. Model discrimination: Simulation study
Our method reveals that the optimal reward probabilities for the model discrimination task are, approximately, [0, 0, 0.6] for one block of trials and [1, 1, 0] for the other block.These optimal designs stand in stark contrast to the usual reward probabilities used in such behavioral experiments, which characteristically use less extreme values (Steyvers et al., 2009).Contrary to our initial intuitions, and, presumably those of previous experiment designers, the extreme probabilities in these optimal designs are effective at distinguishing between particular theoretical commitments of our different models.For instance, the AEG model (see Box 2) can break ties between options with equivalent (observed) reward rates via the "stickiness" parameter.To illustrate this effect, consider the bandit with [1, 1, 0] reward probabilities, as found above.Here, switching between the two options that always produce a reward is different from switching to the option that never produces a reward.Specifically, switching between the winning options can be seen as "anti-stickiness", motivated e.g., out of boredom or epistemic curiosity about subtle differences in the true reward rates of the winning options.This type of strategy is less apparent when the bandit arms are stochastic.As such, these mechanisms can be investigated most effectively by not introducing additional variability due to stochastic rewards and instead isolating their effects.Hence, we can interpret the, perhaps counter-intuitive, optimal experimental designs as effective choices for isolating distinctive mechanisms of our behavioral models or their parameters.
Using the neural network that was trained at that optimal design, we can cheaply compute approximate posterior distributions of the model indicator (as described in Materials and Methods).This yields the confusion matrices in Figure 4, which show that our optimal designs result in considerably better model recovery than the baseline designs.Step 5: Validate the optimal experimental design in silico Once we have converged in our search over experimental designs, we can simulate the behavior of our theories under our optimal experimental design(s).
Irrespective of the final experiment that will be run, model simulations can already provide useful information for theory building, e.g., by noticing that the simulator models generate implausible data for certain experimental designs.This means that we can use simulations to assess whether we can recover our models, or individual model parameters, given the experimental design, as we did in our simulation study.We highlight this again as a critical step, since issues at this stage warrant revising the models under consideration and, importantly, any corresponding real-world experiment would be expected to yield uninformative data and waste resources (Wilson and Collins, 2019).We have illustrated this step using our case study in Box 5 and Box 6, where we present simulation study results for model discrimination and parameter estimation, respectively.

Box 6. Parameter estimation: Simulation study
We next discuss the parameter estimation results, focusing on the WSLTS model; the results for the AEG and GLS model can be found in the SI.We find that the optimal reward probabilities for the WSLTS model are [0, 1, 0],  [0, 1, 1] and [1, 0, 1] for the three blocks.Similar to the model discrimination task, these optimal designs take extreme values, unlike commonly-used reward probabilities in the literature.In Figure 5 we show posterior distributions of the WSLTS model parameters for optimal and baseline designs.We find that our optimal designs yield data that result in considerably improved parameter recovery, as compared to baseline designs.Step 6: Run the real experiment Once we are satisfied with the found optimal experimental design, we can run the actual experiment.Here, the models we used to approximate MI allow us to cheaply perform posterior inference.After running the real-world experiment, we can use our trained ML models to cheaply compute approximate posterior distributions over models and their parameters.This procedure provides a valuable addition to a Bayesian modeling workflow Schad et al. (2021); Wilson and Collins (2019), as computing posterior distributions is a computationally expensive step, which is especially difficult for models with intractable likelihoods.We explain theoretical details below, and showcase an analysis of a real experiment with our case study in Boxes 7−9.

Posterior Estimation
Once the neural network is trained, we can conveniently compute an approximate posterior distribution with a single forward-pass (see Kleinegesse and Gutmann, 2020, for a derivation and further explanations).The NWJ lower bound is tight when T ψ * (v, y) = 1 + log p(v|y, d * )/p(v).By rearranging this, we can thus use our trained neural network T ψ * (v, y) to compute a (normalized) estimate of the posterior distribution, This can be done with the following equation, p(v|y, Neural networks are notorious for large variations in performance, mainly due to the random initialization of network parameters.The quality of the posterior distribution estimated using Equation 3 is therefore quite sensitive to the initialization of the neural network parameters.To overcome this limitation and to obtain a more robust estimate of the posterior distribution, we suggest to use ensemble learning to compute Equation 3.This is done by training several neural networks (50 in our case study) and then averaging estimates of the posteriors obtained from each trained neural network.

Box 7. Human participant study: Methodology
To validate our method on empirical data, we collected a sample of N = 326 participants from Prolific (www.prolific.co).a Participants were randomly allocated to our optimal designs or to one of ten baseline designs (the same as in the simulation study); we refer to the former as the optimal design group and to the latter as the baseline group.The first two blocks correspond to the model discrimination (MD) task, i.e. they were used to identify the model that matches a participant's behavior best.This was done by computing the posterior distribution over the model indicator m using the trained neural networks (see the appendix) and then selecting the model with the highest posterior probability, i.e. the maximum a posteriori estimate.This inference process was done online without a noticeable interruption to the experiment.The following three blocks were then used for the parameter estimation (PE) task.Participants in the optimal design group were allocated the optimal design according to their inferred model (as provided by the MD task).Participants in the baseline group were again allocated to baseline designs.This data collection process facilitates gathering real-world data for both the MD and PE tasks, allowing us to effectively compare our optimal designs to baseline designs.See the appendix for more details about the experimental setup.
Here, we follow the rationale that more useful experimental designs are those that lead to more information (or lower posterior uncertainty) about the variable of interest.The belief about our variable of interest is provided by its posterior distribution, which we can estimate using the neural network output.We assess the quality of these posterior distributions by estimating their entropy, which quantifies the amount of information they encode (Shannon, 1948).This is an effective and prominent metric to evaluate distributions, as it directly relates to the uncertainty about the variable of interest.For the model discrimination task, we specifically consider the Shannon entropy as a metric, since the model indicator m is a discrete random variable.As the continuous analogue of Shannon entropy, we use differential entropy for measuring the entropy of posterior distributions for the parameter estimation task, since the model parameters are continuous random variables.This allows us to quantitatively evaluate our optimal designs and compare their efficacy to that of baseline designs.
a All experiments were certified according to the University of Edinburgh Informatics Research Ethics Process, RT number 2019/58792.

Summary statistics
Additionally, as another component of our approach, we also obtain automaticallylearned (approximate) sufficient summary statistics for data observed with the optimal experimental designs.Summary statistics are often crafted manually by domain experts, which tends to be a time-consuming and difficult task.In fact, hand-crafted summary statistics are becoming less effective and informative as our models naturally become more complex as science advances (Chen et al., 2021).Our learned summary statistics could be used for various downstream tasks, e.g., as an input for approximate inference techniques (such as ABC) or as means to interpret and analyze the data more effectively, e.g., as a lower-dimensional representation used to visualize the data.

Box 8. Human participant study: Results
We here briefly present the results of our human participant study.For the model discrimination (MD) tasks, we find that our optimal designs yield posterior distributions that tend to have lower Shannon entropy than the baseline designs, as shown in Figure 6.Our ML-based approach thus allows us to obtain more informative beliefs about which model matches real individual human behavior best, as compared to baselines.
Having determined the best model to explain a given participant's behavior, we turn to the parameter estimation (PE) task, focusing our efforts on estimating the model parameters that explain their specific strategy.Similar to the MD task, we here assess the quality of joint posterior distributions using entropy, where smaller values directly correspond to smaller uncertainty.Figure 7 shows the distribution of differential posterior entropies across participants that were assigned to the AEG simulator model; we show corresponding plots for the WSLTS and GLS models in the appendix.Similar to the MD task, we find that our optimal designs provide more informative data, i.e. they result in joint posterior distributions that tend to have lower differential entropy, as compared to the ones resulting from baseline designs, as shown in the figure.Box 9. Exploring model parameter disentanglement In addition to measuring the efficacy of our optimal experiments via the differential entropy of posterior distributions, we can ask how well they isolate, or disentangle, the effects of individual model parameters.Specifically, we measure this by means of the (Pearson) correlation between model parameters in the joint posterior distribution, which we present in Figure 8 for all simulator models and for both optimal designs and baseline designs.
Here, a correlation of r = 0 means that there is no (linear) relationship between the respective pair of inferred model parameters across individuals, thereby providing evidence that the data under this design allow us to separate the natural mechanisms corresponding to those parameters.On the other hand, correlations closer to r = −1 or r = 1 would imply that the model parameters essentially encode the same mechanism.As shown descriptively in Figure 8, our optimal designs are indeed able to isolate the mechanisms of individual model parameters more effectively than baseline designs.To assess this observation quantitatively, we first z-transform (Fisher, 1915) the correlation coefficients (i.e., all entries below the main diagonal in the correlation matrices) and compute the average absolute z-score over the model parameters within each participant.The resulting average of absolute z-scores is an expression for the average magnitude of linear dependency between the posterior model parameters for each participant.Next, we assess whether the average absolute z-scores are significantly different between those participants allocated to our optimal designs and those allocated to the baseline designs.We compute this statistical significance by conducting two-sided Welch's t-tests (Welch, 1947).We find that our optimal designs significantly outperform baseline designs for all three models, i.e. for the WSLTS model we find a p-value of p < 0.001 (t(60.12)= 13.89), for the AEG model we have p = 0.004 (t(27.30)= 3.12) and for the GLS model we find that p < 0.001 (t(91.01)= 8.44).This provides additional evidence that our optimal designs are better able to disentangle the effects of individual model parameters than baseline designs.As explained earlier, we argue that this isolation of effects is facilitated by the extreme values of our optimal designs.

Possible Caveats
The process of making all of our assumptions explicit and leaving the process of proposing concrete designs to the optimization procedure can provide us with assurance that our design is optimal with respect to our assumptions.Thereby, this process may also reduce bias in the design of experiments, as the optimization process can easily be replicated.We believe that these properties will provide value towards tackling the notorious replication crisis (Ioannidis, 2005;Pashler and Wagenmakers, 2012;Baker, 2016) and potentially setting new standards.Meanwhile, it is important to point out that the designs found using our optimization procedure are not assured to be the most optimal ones for all scenarios.That is, there is no "free lunch" Wolpert (1996), i.e. one design suits all situations, as the usefulness of experimental designs depends on the theories under consideration, prior beliefs and utility function.For instance, if we already have strong prior beliefs about a particular model parameter (such as people being strongly inclined to explore instead of exploit), there may be little to learn about this parameter and the BOED procedure may instead focus on parameters we have weaker prior beliefs about, which may yield different optimal designs.

Conclusion and outlook
This tutorial provides a flexible step-by-step workflow to finding optimal experimental designs, by leveraging recent advances in machine learning (ML) and Bayesian optimal experimental design (BOED).As a first step, the proposed workflow suggests to define a scientific goal, such as model discrimination or parameter estimation, thereby expressing the value of a particular design as a measurable quantity using a utility function.Secondly, the scientific theories under investigation need to be cast as a computational model from which we can sample synthetic data.The methodologies discussed in this tutorial make no assumption, however, on whether or not the likelihood function of the computational model needs be tractable, which opens up the space of scientific theories that can be tested considerably.In the third step, the design optimization problem needs to be set up, which includes deciding which experimental design variables need to be optimized.The fourth step of our proposed workflow then involves using machine learning to estimate and optimize the utility function.
In doing so, we discussed a recent approach that leverages neural networks to learn a mapping between experimental designs and expected information gain, i.e. mutual information, and provided a novel extension that efficiently deals with behavioral data.In the fifth step, the discovered optimal designs are then validated in silico using synthetic data.This ensures that the optimal designs found are sound and that no experimental resources are wasted, possibly also if it is found that the computational models need to be revisited after this step.Lastly, in the sixth step, the real experiment can be performed using the discovered optimal design.Using the discussed methodologies, the trained neural networks can then conveniently be used to compute posterior distributions and summary statistics, immediately after the data collection.This tutorial and the accompanying case study demonstrate the usefulness of modern ML and BOED methods for improving the way in which we design experiments and thus collect empirical data.Furthermore, the methodologies discussed in this tutorial optimize experimental designs for models of cognition without requiring computable likelihoods, or marginal likelihoods.This is critical to allow provide the methodological support required forscientific theories and models of human behavior that become increasingly realistic and complex.As part of our case study, using simulations and real-world data, we showed that the proposed workflow and the discussed methodologies yield optimal designs that outperform human-crafted designs found in the literature, offering more informative data for model discrimination as well as parameter estimation.We showed that adopting more powerful methodological tools allows us to study more realistic theories of human behavior, and our results provide empirical support for the expressiveness of the simulator models we studied in our case study.More broadly, machine learning has seen success in behavioral research regarding the analysis of large datasets and discovering new theories with promising results (e.g., Peterson et al., 2021) but its potential for data collection and in particular experimental design has been explored remarkably little.We view the present work as an important step in this direction.arm to select (see Equation 4).Conversely, θ 1 corresponds to the probability of switching to another arm when observing a loss in the previous trial, i.e. when r (t−1) = 0, in which case the agent again performs Thompson Sampling to select another bandit arm.However, the agent may also re-select the previous bandit arm with probability 1 − θ 1 even after having observed a loss.
During the exploration phases mentioned above, the WSLTS agent performs Thompson Sampling from a reshaped posterior, which is controlled via a temperature parameter θ 2 .The corresponding choice of bandit arm is then given by where α k and β k correspond to the number of times a participant observed a reward of 1 and 0, respectively, for a bandit arm k.This stands in contrast to standard WSLS, where the agent shifts to another arm uniformly at random.We note that, for θ 2 values close to 1 we recover standard Thompson-Sampling (excluding the previous arm), while for θ 2 → ∞ we recover standard WSLS, and for θ 2 → 0 we obtain a greedy policy.We provide pseudo-code for the WSLTS computational model in Algorithm 1.
posed to randomly selecting any bandit arm, the probability of selecting the previous arm, in order to break ties between options with the same expected reward, is specifically controlled via the second model parameter θ 1 .We provide pseudo-code for the AEG computational model in Algorithm 2.

4:
Update the estimated reward probability for all K bandit arms as α k α k +β k .

5:
Let M be the set of bandit arms with maximal expected reward.Select the first bandit arm a (1) uniformly at random.Re-select the previous bandit arm a (t−1) .Re-select the previous bandit arm a (t−1) .

16:
else 17: Randomly select another bandit arm among the set M .
Generalized Latent State (GLS) (Lee et al., 2011) proposed a latent state model for bandit tasks whereby a learner can be in either an explore or an exploit state and switch between these as they go through the task.Here, we propose the Generalized Latent State (GLS) model, which unifies and extends latent-state and latent-switching models, previously studied in (Lee et al., 2011), allowing for more flexible and structured transitions.a The transition distribution over whether the agent is in an exploit state is specified by four parameters, each of which specify the probability of being in an exploit state at trial t.This transition distribution is denoted by π(l, r t−1 ), where l is the previous latent state and r t−1 the reward observed in the previous trial.We here refer to trials where a bandit arm was selected but failed to produce a reward as failures.We provide pseudo-code for our GLS computational model in Algorithm 3.  behavior by visualizing the (marginal) posterior distribution of the AEG model, as shown in Figure ?? for two example participants.The posterior distribution for the first participant (Figure 6) indicates small values for the first parameter θ 0 of the AEG model, i.e. the ε parameter, which implies that the participant tends to select the arm with the highest probability of receiving a reward as opposed to making random exploration decisions.Larger values of the second parameter θ 1 of the AEG model, i.e. the "stickiness" parameter, imply that the participant is biased towards re-selecting the previously chosen bandit arm.The second example participant (Figure 6) shows markedly different behavior.As indicated by the marginal posterior of the θ 0 parameter, this participant engages in considerably more random choices.Moreover, this participant shows "anti-sticky" behavior, as implied by low values of the second parameter θ 0 .Lastly, we compute the posterior mean for each participant best described by the AEG model and visualize them in a scatter plot in Figure 6.We find that most participants best described by the AEG model show behavior somewhat similar to the first example participant, but as expected, there is inter-individual variation, e.g., with some participants displaying "anti-sticky" behavior.More generally speaking, the proportions of participants allocated to the WSLTS, AEG and GLS models were 62 (37%), 75 (45%) and 29 (18%), respectively.The finding that the largest proportion of participants were best described by the AEG model differs from prior work, where most participants were best described by Win-Stay Lose-Shift (a special case of our WSLTS model) Steyvers et al. (2009).This suggests that a reinterpretation of previous results is required, where the human tendency to stick with previous choices is less about myopically repeating the last successful choice, and more about balancing reward maximization, exploration, and a drive to be consistent over recent runs of choices.While exploring these findings in detail from a cognitive science perspective is beyond the scope of the present work, we view this as a fruitful avenue for future work.
Figure 2 for a schematic of the data-generating process of multi-armed bandit tasks.

Figure 2 :
Figure2: Schematic of the data-generating process for the example of multiarmed bandit tasks.For a given design d (here the reward probabilities associated with the bandit arms in each individual block) and a sample from the prior θ, the simulator generates observed data y, corresponding to the actions and rewards from B blocks.

Figure 3 :
Figure 3: Neural network architecture for behavioral experiments.For each block of data we have a small sub-network (shown in blue) that outputs summary statistics S.These are concatenated with the variable of interest v (e.g., corresponding to a model indicator for model discrimination tasks) and passed to a larger neural network (shown in green).

Figure 4 :
Figure 4: Simulation study results for the model discrimination (MD) task, showing the confusion matrices of the inferred behavioral models, for optimal (left) and baseline (right) designs.

Figure 5 :
Figure5: Simulation study results for the parameter estimation (PE) task of the WSLTS model, showing the marginal posterior distributions of the three WSLTS model parameters for optimal (green) and baseline (orange) designs, averaged over 1,000 simulated observations.The ground-truth parameter values were chosen in accordance with previous work on simpler versions of the WSLTS model(Zhang and Lee, 2010) and as plausible population values for the posterior reshaping parameter.

Figure 6 :Figure 7 :
Figure6: Human-participant study results for the model discrimination (MD) task, showing the distribution of posterior Shannon entropies obtained for optimal (green) and baseline (orange) designs (lower is better).

Figure 8 :
Figure 8: Human-participant study average linear correlations in the posterior distribution of model parameters for all models in the parameter estimation task, and for both optimal (top) and baseline (bottom) designs.

Figure 2 :FigureFigure 4 :Figure 5 :
Figure 2: Results for the parameter estimation (PE) task of the AEG model, showing the marginal posterior distributions of the three AEG model parameters for optimal (green) and baseline (orange) designs, averaged over 1,000 observations.